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Abstract 

Magnetohydrodynamic (MHD) simulations based on the ideal MHD equations have 
become a powerful tool for modeling phenomena in a wide range of applications 
including laboratory, astrophysical, and space plasmas. In general, high-resolution 
methods for solving the ideal MHD equations are computationally expensive and 
Beowulf clusters or even supercomputers are often used to run the codes that imple- 
mented these methods. With the advent of the Compute Unified Device Architecture 
(CUDA), modern graphics processing units (CPUs) provide an alternative approach 
to parallel computing for scientific simulations. In this paper we present, to the au- 
thor's knowledge, the first implementation of MHD simulations entirely on CPUs 
with CUDA, named CPU-MHD, to accelerate the simulation process. CPU-MHD 
supports both single and double precision computation. A series of numerical tests 
have been performed to validate the correctness of our code. Accuracy evaluation 
by comparing single and double precision computation results is also given. Per- 
formance measurements of both single and double precision are conducted. These 
measurements show that our CPU-based implementation achieves speedups (in sin- 
gle precision) of about 10 (ID problems with 4096 grid points), 200 (2D problems 
with 1024^ grid points), and 84 (3D problems with 128^ grid points), respectively, 
compared to the corresponding serial CPU MHD implementation. For double preci- 
sion computation, CPU-MHD still can achieve about 60% speed of the correspond- 
ing single precision computation. In addition, we extend CPU-MHD to support 
the visualization of the simulation results and thus the whole MHD simulation and 
visualization process can be performed entirely on CPUs. 
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1 Introduction 



Magnetohydrodynamic (MHD) equations can be used in modeling phenom- 
ena in a wide range of applications including laboratory [5] , astrophysical [39] , 
and space plasmas [9]. However, they form a nonlinear system of hyperbolic 
conservation laws, which is so complex and high-resolution methods are nec- 
essary to solve them in order to capture shock waves and other discontinu- 
ities. These high-resolution methods are in general computationally expen- 
sive and parallel computational resources such as Beowulf clusters or even 
supercomputers are often utilized to run the codes that implemented these 
methods [22] [13] [18] [14] [50]. 

In the last few years, the rapid development of graphics processing units 
(GPUs) makes them more powerful in performance and more programmable 
in functionality. By comparing the computational power of GPUs and CPUs, 
GPUs exceed CPUs by orders of magnitude. The theoretical peak performance 
of the current consumer graphics card NVIDIA GeForce GTX 295 (with two 
GPUs) is 1788. 48G floating-point operations per second (FLOPS) per GPU 
in single precision while a CPU (Core 2 Quad Q9650 — 3.0 GHz) gives a 
peak performance of around 96GFLOPS in single precision. The release of 
the Compute Unified Device Architecture (CUDA) [24] hardware and software 
architecture is the culmination of such development. With CUDA, one can 
directly exploit a GPU as a data-parallel computing device by programming 
with the standard C language and avoid working with a high-level shading lan- 
guage such as Cg [21], which requires a significant amount of graphics specific 
knowledge and was previously used for performing computation on GPUs. De- 
tailed performance studies on GPUs with CUDA can be found in [4] and [32] . 

CUDA is a general purpose parallel computing architecture developed by 
NVIDIA. It includes the CUDA Instruction Set Architecture (ISA) and the 
parallel compute engine. An extension to C programming language and its 
compiler are provided, making the parallelism and high computational power 
of GPUs can be used not only for rendering and shading, but also for solving 
many computationally intensive problems in a fraction of the time required 
on a CPU. CUDA also provides basic linear algebra subroutines (CUBLAS) 
and fast Fourier transform (CUFFT) libraries to leverage CPUs' capabilities. 
These libraries release developers from rebuilding the frequently used basic op- 
erations such as matrix multiplication. Graphics cards from G8x series support 
the CUDA programming mode; and the latest generation of NVIDIA GPUs 
(GT2xO series or later) unifies vertex and fragment processors and provides 
shared memory for interprocessor communication. 

A increasing number of new GPU implementations with CUDA in different as- 
trophysical simulations have been proposed. Belleman et al. [2] re-implemented 
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the direct gravitational iV-body simulations on GPUs using CUDA. For N > 
10^, they reported a speedup of about 100 compared to the host CPU and 
about the same speed as the GRAPE-6Af. A library Sapporo for perform- 
ing high precision gravitational A'^-body simulations was developed on GPUs 
by Gaburov et al. [10]. This library achieved twice as fast as commonly used 
GRAPE6A/GRAPE6-BLX cards. Stantchev et al. [37] [38] implemented a 
Particle-ln Cell (PIC) code on GPUs for plasmas simulations and visualiza- 
tions and demonstrated a speedup of 11-22 for different grid sizes. Sainio [31] 
presented an accelerated GPU cosmological lattice program for solving the 
evolution of interacting scalar fields in an expanding universe, achieving speedups 
between one and two orders of magnitude in single precision. In the above 
works, no discussion on using double precision on GPUs was reported. In 
MHD simulations, the support of double precision is important, especially for 
nonlinear problems. We will evaluate the performance and accuracy of double 
precision on GPUs in this work. 

In this paper, we present an efficient implementation to accelerate computa- 
tion of MHD simulations on GPUs, called GPU-MHD. To our knowledge, this 
is the first work describing MHD simulations on GPUs in detail. The goal of 
our work is to perform a pilot study on numerically solving the ideal MHD 
equations on GPUs. In addition, the trend of today's chip design is moving 
to streaming and massively parallel processor models, developing new MHD 
codes to exploit such architecture is essential. GPU-MHD can be easily ported 
to other many-core platforms such as Intel's upcoming Larrabee [34], making 
it more fiexible for the user's choice of hardware. This paper is organized as 
follows: A brief description of the CUDA programming model is given in Sec- 
tion 2. The numerical scheme in which GPU-MHD adopted is presented in 
Section 3. In Section 4, we present the GPU implementation in detail. Nu- 
merical tests are given in Section 5. Accuracy evaluation by comparing single 
and double precision computation results is given in Section 6. Performance 
measurements are reported in Section 7 and visualization of the simulation 
results is described in Section 8. We conclude our work and point out the 
future work in Section 9. 



2 A brief description of the CUDA 

The Compute Unified Device Architecture (CUDA) was introduced by NVIDIA 
as a general purpose parallel computing architecture, which includes GPU 
hardware architecture as well as software components (CUDA compiler and 
the system drivers and libraries). The CUDA programming model [24] consists 
of functions, called kernels, which can be executed simultaneously by a large 
number of lightweight threads on the GPU. These threads are grouped into 
one-, two-, or three-dimensional thread blocks, which are further organized into 
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one- or two-dimensional grids. Only threads in the same block can share data 
and synchronize with each other during execution. Thread blocks are inde- 
pendent of each other and can be executed in any other. A graphics card that 
supports CUDA, for example, the GT200 GPU [20], consisting of 30 streaming 
multiprocessors (SMs). Each multiprocessor consists of 8 streaming processors 
(SPs), providing a total of 240 SPs. Threads arc grouped into batches of 32 
called warps which are executed in single instruction multiple data (SIMD) 
fashion independently. Threads within a warp execute a common instruction 
at a time. 



For memory access and usage, there are four types of memory, namely, global 
memory, constant memory, texture memory as well as shared memory. Global 
memory has a separate address space for obtaining data from the host CPU's 
main memory through the PCIE bus, which is about 102 GB/sec in the GT200 
GPU. Any valued stored in global memory can be accessed by all SMs via load 
and store instructions. Constant memory and texture memory are cached, 
read-only and shared between SPs. Constants that are kept unchange during 
kernel execution may be stored in constant memory. Built-in linear interpo- 
lation is available in texture memory. Shared memory is limited (16 kB for 
GT200 GPU) and shared between all SPs in a MP. For detailed information 
concerning memory optimizations, we refer the reader to "CUDA Best Prac- 
tice Guide" [26]. 



Double precision is one important concern in many computational physics 
applications, however, supports of double precision are limited to the NIVDIA 
cards having Compute Capabihty 1.3 (See Appendix A in [24]) such as the 
GTX 260, GTX 280, Quadro FX 5800 (contains one GT200 GPU), and Tesla 
C1060 (contains one GT200 GPU) and S1070 (contains four GT200 CPUs). In 
GT200 GPU, there are eight single precision floating point (FP32) arithmetic 
logic units (ALUs) (one per SP) in SM, but only one double precision floating 
point (FP64) ALU (shared by eight SPs). The theoretical peak performance 
of GT200 GPU is 936 GFLOPS in single precision and 78 GFLOPS in double 
precision. In CUDA, double precision is disabled by default, ensuring that 
all double numbers are silently converted into float numbers inside kernels 
and any double precision calculations computed are incorrect. In order to 
use double precision floating point numbers, we need to call nvcc: "-arch = 
sm_13'\ The flag "-arch = siii_13" in the command tells "nvcc" to use the 
Compute Capability 1.3 which means enabling the double precision support. 



In Sections 6 and 7 we compare the accuracy and actual performance of GPU- 
MHD in single and double precision. 
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3 Numerical scheme 



The ideal MHD equations with the assumption of the magnetic permeabil- 
ity = 1 can be represented as hyperbolic system of conservation laws as 
follows [11] 

^ + V- ipv) = (1) 

^ + V- (pvv - BB) + VP* = (2) 

ot 

BB 

— -Vx{vxB)^0 (3) 
— + V-{{E + nv-B{B.v))=0 (4) 



Here, p is the mass density, pv the momentum density, B the magnetic field, 
and E the total energy density. The total pressure P* = P + ^ where P is the 
gas pressure that satisfies the equation of state, F = (7 — — p^ — 
In addition, the MHD equations should obey the divergence-free constraint 
V • B = 0. 

Over the last few decades, there has been a dramatic increase in the number of 
publications on numerical solution of ideal MHD equations. In particular the 
development of shock-capturing numerical methods for ideal MHD equations. 
We do not provide an exhaustive review of the literature here. A comprehen- 
sive treatment of numerical solution of MHD equations can be found in [17], 
for example. Pen et al. [28] proposed a free, fast, simple, and efficient total vari- 
ation diminishing (TVD) MHD code featuring modern high-resolution shock 
capturing on a regular Cartesian grid. This code is second-order accuracy in 
space and time and enforces the V ■ B — constraint to machine precision. 
Due to these advantages and convenience for GPU verse CPU comparison, the 
underlying numerical scheme in GPU-MHD is based on this work. A detailed 
comparison of shock capturing MHD codes can be found in [43], for exam- 
ple. We plan to explore other recent high-order Godunov schemes such as [19] 
and [41] for GPU-MHD as our future work. 

We briefly review the numerical scheme [28] we adopted in GPU-MHD here. 
In this numerical scheme, the magnetic field is held fixed first and then the 
fluid variables are updated. A reverse procedure is then performed to complete 
a one time step. Three dimensional problem is split into one-dimensional sub- 
problems by using a Strang-type directional splitting [42] . 

Firstly, we describe the fluid update step in which the fluid variables are up- 
dated while holding the magnetic field fixed. The magnetic field is interpolated 
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to cell centers for second-order accuracy. By considering the advection along 
the X direction, the ideal MHD equations can be written in flux-conservative 
vector form as 

du dF{u) 



dt 



+ 



dx 



= 



(5) 



where the flux vector is given by 



pVx 
pvl + P* 

PVxVy 
PVxVz 



*-Bl 

BxBy 



\{E + P*)vx- 



BxB ■ V 



(6) 



Equation (5) is then solved by Jin & Xin's relaxing TVD method [15]. With 
this method, a new variable w — F{u)/c is defined, where c{x,t) is a free 
positive function called the flux freezing speed. For ideal MHD equations, we 
have u — {ui, U2, u^, u^, u^) — (p, pVx, pvy, pvz, E) and equations 



du <9 , , 
dw d , , 

+ 







(7) 

(8) 



These equations can be decoupled through a change of left- and right-moving 
variables = {u + w)/2 and = {u — w)/2 



The above pair of equations is then solved by an upwind scheme, separately for 
right- and left-moving waves, using cell-centered fluxes. Second-order spatial 
accuracy is achieved by interpolating of fluxes onto cell boundaries using a 
monotone upwind schemes for conservation laws (MUSCL) [47] with the help 
of a flux limiter. Runge-Kutta scheme is used to achieve second-order accuracy 
of time integration. 

We denote as the cell-centered values of the cell n at time t, F^^ as the 
cell-centered flux in cell n. As an example, we consider the positive advection 
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velocity, negative direction can be obtained in a similar way. We obtain the 
first-order upwind flux F^^^'y^ from the averaged flux F/^ in cell n. Two second- 
order flux corrections can be deflned using three local cell-centered fluxes as 
follows 

= " 3 ""^ (11) 

= ^"^^ ~ (12) 



When the corrections have opposite signs,there is no second-order correction 
in the case of near extrema. With the aid of a flux limiter (p we then get the 
second-order correction 

An+i/2 - 0(AF^i/2' AFfi/2) (13) 
The van Leer limiter [46] 

vanleer(a, b) — (14) 

is used in GPU-MHD. By adding the second-order correction to the flrst-order 
fluxes we obtain second-order fluxes. For example, the second-order accurate 
right-moving flux i^„4i/2 t)e calculated 

= K + An+1/2 (15) 

The time integration is performed by calculating the fluxes -F'(w^) and the 
freezing speed cj^ in the first half time step is given as follows 

^..Mf.^^.^_^KiMizI^^^ (16) 

where-F^_,_^y2 = -^n+i/2~-^n+i/2 computed by the first-order upwind scheme. 
By using the second-order TVD scheme on we obtain the full time 

step wjj"^^* 

' pf+At/2 _ j^t+At/2- 

^ 1 , (17) 

To keep the TVD condition, the flux freezing speed c is the maximum speed 
information can travel and should be set to + (7p/p + -B^/p)^/^ as the 
maximum speed of the fast MHD wave over all directions is chosen. As the 
time integration is implemented using a second-order Runge-Kutta scheme. 
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the time step is determined by satisfying the CFL condition 



Cmax = [max{\v^\, \vy\, \vz\) + i'jp/p + B^/p)^/'^] 



(18) 



where cfl is the Courant-Number and cfl ^ 1 is generaUy set to cfl ~ 0.7 for 
stabihty, and B is the magnitude of the magnetic field. Constrained transport 
(CT) [8] is used to keep the V • S = to machine precision. Therefore, the 
magnetic field is defined on cell faces and it is represented in arrays [28] 



Bx{i,j,k) = (fix)i-i/2j,fc 
By{i,j,k) = {By)ij_i/2,k 
Bz{i,j,k) = (S^)ij,fc_i/2 



(19) 



where the cell centers are denoted by {i,j,k) = {xi,yj,Zk), and faces by {i ± 
1/2, j, k), ± 1/2, k), and k ± 1/2), etc. The cells have unit width for 
convenience. 

Secondly, we describe the update of the magnetic field in separate two-dimensional 
advection-constraint steps along a;-direction while holding the fiuid variables 
fixed. The magnetic field updates along y and z-directions can be handled in 
a similar matter. We follow the expressions used in [16]. For example, we can 
calculate the averaging of v along x direction as follows 



x)i,j+l/2,k — ^ [('^a;)i+ij+i/2,fe + ^ ('ya;)ij+i/2,jk + ('^a;)i_i j+i/2,jt 

A first-order accurate fiux is then obtained by 

i'"^By)i,j+l/2,k ' Mi+l/2,j+l/2,k > 



(20) 



i+l/2,j+l/2,k 



— < 



(21) 



(^'x52/)j+lj+l/2,A; > {V^)i+l/2,j+l/2,k 



< 



where the velocity average is 



{Vx)i+l/2,j+l/2,k — 2 



u 



x)i,j+l/2,k 



+ {Vx) 



i+l,j+l/2,k 



(22) 



B^ is updated by constructing a second-order-accurate upwind electromotive 
force (EMF) VyB^ using Jin & Xin's relaxing TVD method [15] in the ad- 
vection step. Then this same EMF is immediately used to update By in the 
constraint step. 

Extension to three dimensions can be done through a Strang-type directional 
splitting [42]. Equation (5) is dimensionally split into three separate one- 
dimensional equations. For a time step At, let fluid^ be the fiuid update 
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along X. Bx^y be the update of along y, and Li be the update operator 
of to by including the flux along i direction. Each Li includes three 

update operations in sequence, for example, includes fluid^, By^^, and 
Bz^x- A forward sweep and a reverse sweep are defined 

and u*"^^^* = LxLyL^u^'^^*, respectively. A complete update combines a for- 
ward sweep and reverse sweep. The dimensional splitting of the relaxing TVD 
can be expressed as follows [45] 

(23) 
(24) 
(25) 

where Ati, At2, and Ats are sequential time steps after each double sweep. 



W*2 


_ ^ti+2Ati 


— Lr^LyLzLzLyL^U^^ 






— LzLy.LyLyLxL^v^'^ 




_ ^t3+2At3 


— LyLzL^Ly.LzLy'u}'^ 



4 GPU implementation 



In this section, we provide the implementation details of GPU-MHD. With 
GPU-MHD, all computations are performed entirely on GPUs and all data is 
stored in the GRAM of the graphics card. Currently, GPU-MHD works on a 
regular Cartesian grid and supports both single and double precision modes. 



4-1 Data storage arrangement 



The most intuitive way to write a parallel program to solve a multidimen- 
sional problem is to use multidimensional arrays for data storage and mul- 
tidimensional threads for computation. However, the ability of the current 
CUDA is limited in supporting multidimensional threads, therefore, we could 
not implement our code in such a straightforward way. Especially in three 
dimensions or higher dimensions, there are still some limitations in handling 
multidimensional arrays and multidimensional threads. As a result, the most 
primitive way is to store data in one-dimension and perform parallel computa- 
tion with one-dimension threads. By using an indexing technique, our storage 
and threading method can be extended to to solve multidimensional problems. 
Our data storage arrangement is expressed in Fig. 1 and in Equations (26) to 
(28). 

INDEX^ = index / {SIZEy x SIZEz) 
< INDEXy = [index mod {SIZEy x SIZEz)]/SIZEz (26) 
INDEXz = index mod SIZEz 
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X + 1 X + 1 



z + 1 




y 
+ 



1 



(0, 1, 0) (0, 2, 1) 



0) 



Fig. 1. Mapping from 3D array to ID array in column major. 



INDEX^ ± 1 



index ± {SIZEy x SIZE,) 



(27) 



INDEX,, ± 1 



index ± SIZE, 



(28) 



INDEX, ± 1 



index ± 1 



(29) 



Here INDEX^, INDEXy, and INDEX, are the indexes of a 3D matrix. 
index is the ID index used in GPU-MHD, SIZEy, and SIZE, are the matrix 
size (number of grid points in our study) of a 3D matrix. 

Equation (26) expresses the mapping of three-dimensional indexes to one- 
dimensional indexes. Equations (28) to (29) express the shift operations. Shift 
operations are very important in numerical solution of conservation laws be- 
cause some calculations are based on the neighboring grid points. 

4 ■ 2 Program flow 

A "CUDA kernel" is a function running on GPU [24]. Noted that the CUDA 
kernel will process all grid points in parallel, therefore, a For instruction is 
no needed for going through all grid points. GPU-MHD includes the following 
steps: 

(1) CUDA initialization 

(2) Setup the initialize condition for the specified MHD problem: 

u = (mi, ^2, Ms, ^4, Ms) of all grid points, B = {B^, By, B,) of cell faces, 
and set parameters such as time t, etc. 

(3) Copy the the initialize condition u, B to device memory (CUDA global 
memory) 

(4) For all grid points, calculate the Cmax by Equation (18) (implemented 



with a CUDA kernel) 
(5) Use cublaslsamax (in single precision mode) function or cublasldamax 
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(in double precision mode) function of the CUBLAS library to find out 
the maximum value of all Cmax, and then determine the At 

(6) Since the value of At is stored in device memory, read it back to host 
memory (RAM) 

(7) Sweeping operations of the relaxing TVD (Calculation of the Lj,i = 
X, y, z, implemented with several CUDA kernels, will be explained in the 

next subsection) 

(8) t = t + 2At 

(9) If t reaches the target time, go to next step 
else repeats the procedure from step (4) 

(10) Read back data u, B to host memory 

(11) Output the result 

The program flow of GPU-MHD is shown in Fig. 2. After the calculation of 
the CFL condition, the sweeping operations will be performed. The sweeping 
operation Li will update both the fluid variables and orthogonal magnetic 
fields along i dimension. This is a core computation operation in the relaxing 
TVD scheme described in Section 3. 

The CFL condition for the three-dimensional relaxing TVD scheme is obtained 
by Equation (18). The procedure is to calculate all the Cmax of each grid point 
and find out the maximum value. In GPU-MHD^ parallel computation power 
of CUDA is exploited to calculate the Cmax of each grid point in parallel and 
all the Cmax values are stored in a matrix. Then the cublaslsamax function 
is used (in double precision mode, the cublasldamcix function is used) to 
find out the maximum Cmax of the matrix in parallel (called the reduction 
operation). The cublaslsamax function is provided in the CUBLAS library 
— a set of basic operations for vector and matrix provided by NVIDIA with 
the CUDA toolkit [24] . The reason we read the At back and store both At and 
t in host memory is due to the data in device memory cannot be printed out 
directly in the current CUDA version. This information is useful for checking 
if there is any problem during the simulation processing. The implementation 
of sweeping operations will be explained in the next subsection. 



4.3 Sweeping operations 

Before we start to describe the sweeping operations, consideration of memory 
arrangement is presented first in the following. 

Implementing parallel computation using CUDA kernels is somewhat similar 
to parallel implementation on a CPU-cluster, but it is not the same. The major 
concern is the memory constrain in CPUs. CUDA makes parallel computation 
process on CPUs which can only access their graphics memory (GRAM). 



11 



System Initialization (CUDA OpenGL) 



MHD Test/Problem Initialization 
(Iniual conditions/boundary conditions) 



i 

Calculation of CFL condition and find 

out d suitable ttt 

Time ( = f+dt, time step ~ 1 




-t fTiTI — i- rLjri ► fXiTI — ►[XT] » fTy~| — C^H" 

1 H~La1 - fUl ^~Lr\ . fton " fUl H"!^ 

2 UjTF] — ►[m — Ki^D — — — '{S} 

Fig. 2. The flow chat of GPU-MHD. 



Therefore, data must be stored in GRAM in order to be accessed by GPUs. 
There are several kinds of memory on graphics hardware including registers, 
local memory, shared memory, and global memory, etc., and they have different 
characteristics and usages [24], making memory management of CUDA quite 
different compared to parallel computation on a CPU-cluster. In addition, 
even the size of GRAM in a graphics card increases rapidly in newer models 
(for example, the latest NVIDIA graphics card — GeForece GTX 295 has 
1.75G GRAM), but not all the capacity of GRAM can be used to store data 
arbitrarily. Shared memory and local memory are flexible to use, however, 
their sizes are very limited in a block and thus they cannot be used for storing 
data with large size. In general, numerical solution of conservation laws will 
generate many intermediate results (for example, ■u*"'"^*/^, F, c, w, etc.) during 
the computation process, these results should be stored for subsequent steps 
in the process. Therefore, global memory was mainly used in GPU-MHD. 

After the maximum value of Cmax in Equation (18) is found, we can get the At 
by determining the Cour ant- Number (c//). The sequential step is the calcu- 
lation of Li {i = X, y, z). The implementation of Lj includes two parts: update 
the fluid variables and update the orthogonal magnetic fields. As an example, 
the process for calculating is shown in Fig. 3 where each block was imple- 
mented with one or several CUDA kernels. The process for calculating Ly or 
Lz is almost the same as except that the dimensional indexes are different. 
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Fig. 3. Calculation process of L^. 



The first part of tlie L^, calculation process is fluids . The fluid variables will 
be updated along x. Algorithm 1 shows the steps and GPU kernels of this 
process (the data of u and B are already copied to device memory), all the 
steps are processed on all grid points with CUDA kernels in parallel. 

In this process, we have to calculate the magnetic fields of the grid point 
(Equation 19) first because all the magnetic fields are defined on the faces 
of the grid cell [28]. To update the fluid variables of L^, the main process, 
which includes one or even several CUDA kernels, is to calculate the affect of 
the orthogonal magnetic fields to the fluid variables of Equations (6), (9) and 
(10). One such main process gives the flux of the At/2 step. After two main 
processes of flux calculation and the other difference calculations, the value of 
fluid — w is updated from n* to in one process. 

The second part of the calculation process is to update the orthogonal 
magnetic fields in ?/-dimension (By^^), and 2;-dimension {B;^^^) with the fluid 
along x-dimension. The strategy and implementation are similar to those in 
the first part but with a different algorithm for the orthogonal magnetic fields. 
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Algorithm 1 Algorithm of fluid^, all equations and difference calculations 

are processed using CUDA kernels 
1: load u, B and At 

2: memory allocation for the storage of the intermediate results: B tempi Utemp, 
fluxtemp, othertemp, (otheFtemp iucludcs the storage of F, c, w, etc) 

3: Btemp results obtained by Equation (19) with B, {B stored the mag- 
netic field of the cell faces) 

4: othertemp ^ results obtained by Equations (6) and (9) with u 

5: fiuxtemp ^ the flux of a half time step: difference calculation ("First- 
Order Upwind Scheme of Fluid" CUDA kernels in Fig. 3) obtained by 
Equation (16) using othertemp 

6: Utemp ^ calculate the intermediate result (m*+'^*/^) using Equation (16) 
with u and fiuxtemp 

7: othertemp ^ results obtained by Equations (6) and (9) with Ut^rap (the 
same algorithm and same CUDA kernels in Step 4) 

8: fluxtemp •<— the flux of another half time step: difference calculation 
("Second-Order TVD Scheme of Fluid" CUDA kernels in Fig. 3) obtained 
by Equation (17) and the limiter (Equation (14)) using othertemp 

9: calculate the result of u^^^^ with fiuxtemp using Equation (17) and save 

it back to u 
10: free the storage of the intermediate results 

11: (continue to the second part of L-^, update the orthogonal magnetic fields) 



In Algorithm 1, the calculations in steps (4) to (9) are the steps for By^^, 
and steps (11) to (16) are the steps for B^^x- The steps for By^^ and B-^^^ 
are almost the same, and the only different parts are the dimensional indexes 
of the difference calculations, and the affected magnetic fields: By and B^. 
After the first part of the fiuid is updated to This change of the 

fiuid affects to the orthogonal magnetic fields. Therefore, the corresponding 
change (flux) of orthogonal magnetic flelds can be calculated with the density 
and velocity of the updated fluid Then the orthogonal magnetic flelds 

are also updated to B*^'^^ and -B*"^^*, and also, these changes give effects to 

After one process of L^, both fluid and magnetic flelds are updated to t + At 
with the affect of the flow in a;-dimension. And a sweeping operation sequence 
includes two L^, Ly, and (see Equations (23)). So we actually get the 
updated fluid and magnetic fields of t + 2At after one sweeping operation 
sequence. Note that the second in the sequence is a reverse sweeping oper- 
ation, the order of fluid^, By^^ and B^^^ has to be reversed: By^^ and B^^^ 
first, and fluid^ second. 

As we mentioned before, numerical solution of conservation laws needs lots 
of memory because there are many intermediate results generated during the 
computation process. These intermediate results should be stored for the next 
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Algorithm 2 Algorithm of {By^^) and (-Bz-j.^), all equations and difference 
calculations are processed using CUDA kernels 

1: (after the processes of fluid, wc obtain an updated u) 

2: load Ui (density p), U2 {pVx), B and At 

3: memory allocation for the intermediate results: B^g^p, fiuxtemp, vxtemp 

and VXface 

4: vxtemp determine the fluid speed with the updated Ui and U2 in fluid^, 

with the difference calculated in y-dimension 
5: VXface ^ Results obtained by Equation (20) 

6: fluxtemp ^ the flux of a half time step: difference calculation of "flux of 
magnetic fleld in y-dimension" ( "First-Order Upwind Scheme of Magnetic 
Field" CUDA kernels in Fig. 3) obtained by Equations (16) and (21)) 
7: Bternp calculate the intermediate result (1**+^*/^) by applying Equation 

(16) to By (not by applying Equation (16) to u) with By and fiuxtemp 
8: fluxtemp the flux of another half time step: difference calculation 
("Second-Order TVD Scheme of Magnetic Field" CUDA kernels in Fig. 3) 
obtained by Equation (16), the limitcr of Equation (14) and Equation (21) 
9: calculate the result of -B*^^* and with fluxtemp by applying Equa- 

tion (17)) to By, and save it back to B 

10: (the following steps is similar to above steps but the affected orthogonal 
magnetic fleld is changed from y to z) 

11: vxtemp determine the fluid speed with the updated Ui and U2 in fluid^, 
with the difference calculated in 2;-dimension 

12: VXface ^ Results obtained with Equation (20) using index of i, j, k + 1/2 

13: fluxtemp the flux of a half time step: difference calculation of "flux of 
magnetic fleld in ^-dimension" ( "First-Order Upwind Scheme of Magnetic 
Fhed" CUDA kernels in Fig. 3) obtained by Equations (16) and (21) 

14: btemp calculate the intermediate result (u*^'^*/^) by applying Equation 
(16) to Bz (not by applying Equation (16) to u) with B^ and fiuxtemp 

15: fluxtemp the flux of another half time step: difference calculation 
("Second-Order TVD Scheme of Magnetic Flied" CUDA kernels in Fig. 3) 
obtained by Equation (16), the limiter of Equation (14) and Equation (21) 

16: calculate the results of B^^^^ and B^^^^ with fluxtemp by applying Equa- 
tion (17)) to Bz- and save it back to B 

17: free the storage of the intermediate results 



calculation steps which need the information of the neighboring grid points 
obtained in the previous calculation steps. Otherwise, in order to avoid the 
asynchronous problem in parallel computation, we have to do many redundant 
processes. With the purpose to minimizing the memory usage, not only the 
calculation process of is divided into several steps (CUDA kernels), but also 
the intermediate results are stored as less as possible. The processes dealing 
with the difference calculations are also divided into several steps to minimize 
the storage of the intermediate results and to guarantee there is no wrong 
result caused by asynchronous problem. 
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It should be realized that most of the processes in the three-dimensional re- 
laxing TVD scheme with the dimensional splitting technique is similar. Pen 
et al. [28] swapped the data of x, and ^-dimensions while GPU-MHD used 
one-dimensional arrays. But the similar swapping technique can be applied in 
our case with some indexing operations. Instead of transposing or swapping 
the data, we implemented each calculation part of the flux computation with 
two sets of CUDA kernels: one set is the CUDA kernels for calculating the 
relaxing TVD scheme (we call it TVD kernel here) and the other set is the 
CUDA kernels actually called by Lj operations (we call them Lj kernels here). 
Indexing operations are contained in all Lj kernels. After the index is calcu- 
lated, TVD kernels are called and the indexes are passed to the TVD kernels, 
letting the TVD kernels calculate the flux of corresponding dimension. There- 
fore, the difference among L^., Ly, and is the dimensional index. The flux 
computation of GPU-MHD is shown in Fig. 4. 



Lx 



IMHD Flux Calculation 



X indexing of MHD 
Flux Calculation 



Ly 



MHD Flux Calculation 



A. 



Y indexing of MHD 
Flux Calculation 



Lz 



JL 



MHD Flux Calculation 



Z indexing of MHD 
Flux Calculation 



Same MHD Flux Calculation Kernel 



Fig. 4. Flux computation in GPU-MHD. 



The indexing operation swaps the target that will be updated and the neigh- 
boring relationship will also be changed accordingly. For example, the calcu- 
lation that uses x + 1 as the neighboring element in will be changed to 
y + 1 m Ly. As transposing the data in a matrix needs more processing time, 
it is efficient and flexible to extend the code to multidimensional by dividing 
the indexing operation and flux calculation. 

After the whole pipeline of Fig. 2 is completed, the MHD simulation results 
will be stored in GRAM and these results are readily to be further processed 
by the GPUs for visualization or read back to the CPU for other usage. 



5 Numerical tests 



In this section, several numerical tests in one-dimensional (ID), two-dimensional 
(2D), and three-dimensional (3D) for validation of GPU-MHD are given. The 
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results shown in this section are computed with single precision mode in GPU- 
MHD. The difference between single precision and double precision computa- 
tion results will be discussed in Section 6. 



5.1 One-dimensional problems 



5.1.1 Brio- Wu shock tube 

ID Brio-Wu shock tube problem [3] which is a MHD version of the Sod prob- 
lem [35] , consists of a shock tube with two initial equilibrium states as follows 
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Right side {x > 0.5) 
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Constant value of 7 = 2 was used and the problem was solved for x e [0, 1] 
with 512 grid points. Numerical results are presented ai, t — 0.08L in Fig. 5 
and Fig. 6, which include the density, the pressure, the energy, the y- and 
2;-magnetic field components, and the x-, y- and ^-velocity components. The 
results are in agreement with those obtained by Brio and Wu [3] and Zachary 
et al. [49]. 
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Fig. 5. Results (Part I) of Brio-Wu shock tube problem at t = 0.08L. The result 
computed with 512 grid points is shown with circles and solid line shows reference 
high resolution result of 4096 grid points. 

5.1.2 MHD shock tube 

The second ID test is the MHD shock tube problem considered in [6]. 
Left side {x < 0.5) 
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Fig. 6. Results (Part II) of Brio-Wu shock tube problem at t = 0.08L. The result 
computed with 512 grid points is shown with circles and solid line shows reference 
high resolution result of 4096 grid points. 

Right side (x > 0.5) 
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Constant value of 7 = 5/3 was used and the problem was solved for x G [0, 1] 
with 512 grid points. Numerical results are presented at t = 0.2L in Fig. 7 
and Fig. 8, which include the density, the pressure, the energy, the y- and 
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2;-magnetic field components, and the x-, y- and 2;-velocity components. The 
results are in agreement with those obtained by [6] and [30]. 



Resolution=512 
- Resolution=4096 




Resolution=512 
- Resolution=4096 



hfill 

Fig. 7. Results (Part I) of MHD shock tube test at t = 0.2L. The result com- 
puted with 512 grid points is shown with circles and solid line shows reference high 
resolution result of 4096 grid points. 



5.2 Two-dimensional problems 



5. 2. 1 Orszag- Tang problem 

The first 2D test is Orszag- Tang problem [27], which is used to study incom- 
pressible MHD turbulence. In our test, the boundary conditions are periodic 
everywhere. The density p, pressure p, initial velocities {vx,Vy,Vz), and mag- 
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Fig. 8. Results (Part II) of MHD shock tube test at t = 0.2L. The result com- 
puted with 512 grid points is shown with circles and solid line shows reference high 
resolution result of 4096 grid points. 
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The Orszag-Tang vertex test was performed in a two-dimensional periodic 
box with 512 x 512 grid points. The results of the density and gas pressure 
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evolution of the Orszag-Tang problem at t = 0.5L and t = l.OL are shown in 
Fig. 9, where the complex pattern of interacting waves is perfectly recovered. 
The results agree well with those in Lee et al. [19]. 




Fig. 9. Results of the density (top) and gas pressure (bottom) of Orszag-Tang vortex 
test at t = 0.5L (left) and t = l.OL (right) computed with 512 x 512 grid points. 



5.2.2 Two-dimensional blast wave problem 



The second 2D test is the MHD blast wave problem. The MHD spherical blast 
wave problem of Zachary et al. [49] is initiated by an over pressured region 
in the center of the domain. The result is a strong outward moving spherical 
shock with rarified fluid inside the sphere. We followed the test suite [36] 
of Athena [40]. The condition for 2D MHD blast wave problem is listed as 
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follows [36] 
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p = 5/(127r), 7 = 5/3 
spherical region center = (0.5, 0.5), r = 0.1 
(0 < a; < 1) (0 < y < 1) 

In Fig. 10, we present images of the density and gas pressure at t = 0.2L 
computed with 512 x 512 grid points. The results are in excellent agreement 
with those presented in [36]. 
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Fig. 10. Results of the density (left) and gas pressure (right) of the 2D blast wave 
test at i = 0.2L, computed with 512 x 512 grid points. 



5.2.3 MHD rotor problem 

The third 2D test is the MHD rotor problem. The problem was taken from [1]. 
It initiates a high density rotating disk with radius rg = 0.1 of fluid measured 
from the center point [x^y) = (0.5,0.5). The ambient fluid outside of the 



spherical region of ri = 0.115) has low density and Vx 



0, and the fluid 
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between the high density disk fluid and ambient fluid (ri > r > ro, where 
r = \J{^ ^ 0.5)^ + (?/ ~ 0-5)^) has linear density and angular speed profile 
with p = 1 + 'df.v^ = —fvQ{y — 0.5)/r and Vy = —fvo{x — 0.5)/r where 
f — {vi — r)/(ri — ro). Two initial value sets of Vq,p, and 7 provided in [1] 
and [44] were tested. The initial condition for 2D MHD Rotor problem is listed 
as follows 



spherical region center = (0.5,0.5), Tq = 0.1, ri — 0.115 
/ = (ri - r)/(ri - ro), (0 < x < 1) (0 < y < 1) 



(49) 
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In Fig. 11, we present images of the density, gas pressure of the two rotor 
problems computed with 512 x 512 grid points. The results are in excellent 
agreement with those presented in [1] and [44]. 




Fig. 11. Results of the density (top-left), gas pressure (top-right) of the first MHD 
rotor test at t = 0.15L, results of the density (bottom-left), gas pressure (bot- 
tom-right) of the second MHD rotor test at t = 0.295L, both computed with 512 x 
512 grid points. 

5.3 Three-dimensional blast wave problem 



The 3D version of MHD spherical blast wave problem was also tested. The 
condition is listed as follows [36] 



Vx 







< Vy 


> = < 


> 










(56) 



25 









< By 


> = < 


1/^3 > 


V ) 




1/V3 



10 inside the spherical region 
0.1 outside the spherical region 



P=l, 7 = 5/3 

spherical region center = (0.5,0.5,0.5), r = 0.1 
(0 < a; < 1) (0 < 7/ < 1) (0 < 2 < 1) 



(57) 



(58) 



(59) 



Fig. 12 and Fig. 13 show the results of 3D blast wave problem, which include 
the density, gas pressure, and magnetic pressure at t — O.IL and t — 0.2L 
shced along x-y plane at 2; = 0.5. The test was computed with 128 x 128 
X 128 grid points. Due to the scarcity of published 3D test results, we do 
not make direct contact with results presented in the literature here. Consid- 
ering only the u and B, the memory requirement of 256^ MHD problem is 
about 512MB GRAM for single precision and 1024MB GRAM for double pre- 
cision, respectively. If the storage of intermediate results such as Btempi '^temp, 
fluxtemp and F etc. (Sec Section 4.3) are considered, the amount of memory 
requirement will be about 2.25GB (single precision). As wc mentioned in Sec- 
tion 4.3, not all the capacity of GRAM can be used to store data arbitrarily. 
There are actually two GPUs on the GTX 295 and the 1.75 GB GRAM is 
the total amount of the GRAM shared by two GPUs, so that only less than 
1.75/2 = 0.875 GB GRAM can be used. As a result, the test of 3D problem 
with 256^ resolution are not able to be provided on a graphics card. 



6 Accuracy evaluation 



In MHD simulations, accuracy is always to be considered since the error may 

increase fast and make the simulation crashed if low precision is used for com- 
putation. Scientific computations such as MHD simulation mostly use double 
precision to reduce errors. In this section, the results generated by GPU-MHD 
using single precision and double precision modes are shown and compared. 

The difference between the results of double precision and single precision 
computation of the 512 x 1 x 1 one-dimensional Brio-Wu shock tube problem 
is shown in Fig. 14. Two curves are almost the same but there are actually 
some differences with the amount of error < ±10""^: 
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Fig. 12. Results of the density (top-left), gas pressure (top-right) and magnetic 
pressure (bottom) of 3D blast wave test at t = O.IL sliced along x-y plane at z = 
0.5 and computed with 128 x 128 x 128 grid points. 

In 2D cases, the absolute difference between the results of double precision 
and single precision computation of MHD Rotor test {t = 0.15L) and Orszag- 
Tang vortex test {t = 0.5L and t = l.OL) are shown in Fig. 15 and Fig 16, 
respectively. The double precision computation results of both tests are also 
shown in the left-hand side of these figures. 

For the MHD Rotor test, even the resulting image (left in Fig. 15) looks 
similar to the single precision resulting image (top-left of Fig. 11), the high 
differences at the dense region can be found. Experimental result shows that 
the maximum error is large than ±3.5 x 10~^. 

Fig. 16 shows the absolute difference between the results of double precision 
and single precision computation of Orszag-Tang test at t = 0.5L and t = l.OL. 
As the simulation time increases, the maximum error increases from about 
±8 X 10-5 to ±0.03. 
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Fig. 13. Results of the density (top-left), gas pressure (top-right) and magnetic 
pressure (bottom) of 3D blast wave test at t = 0.2L sliced along x-y plane at z = 
0.5 and computed with 128 x 128 x 128 grid points. 

Fig. 17 and Fig. 18 show the resulting images of the simulation using double 
precision and the contours of the absolute differences between the results of 
double precision and single precision computation of 3D blast wave test with 
128'^ grid points at t = O.IL and t = 0.2L. As it is a high dimension computa- 
tion in low resolution, the differences between them are clear. The number of 
grid points having higher difference value increases, and the amount of differ- 
ence is still kept as error < ±10~^. Small difference value makes the double 
precision resulting images (Fig. 17 and Fig. 18) looked similar to the single 
precision resulting images (Fig. 12 and Fig. 13). 

An important point can be realized that not only the grid points at the high 
density region has high difference value, but also the number of grid points 
having high difference value and the amount of the difference values are in- 
creasing along with the increase of the simulation time. Higher dimension is 
another factor to introduce noticeable differences between the computation 
results with different precisions because higher dimension means a grid point 
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Fig. 14. Result oi p double - P single (top), E double - E single (middle) and Pd^uble-P single 

(bottom) of ID Brio-Wu shock tube problem at t = 0.08L with 512 grid points 



has more neighbors and more neighbors need more computation steps in one 
time step. As a result the differences become more obvious. Therefore, for a 
long-term simulation, double precision computation is a must. 
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Fig. 15. Results of pdoubie (left) and \pdoubie - Psingie\ (right) of MHD rotor problem 
at t = 0.15L with 512^ grid points 

7 Performance measurements 



The performance measurements of the GPU and CPU implementations as 
well as the computation using double precision and single precision are carried 
out in this section. Different numbers of grid points and different dimensions 
were used in the performance tests. We run both GPU-MHD and Pen et 
a/.'s FORTRAN/CPU MHD code [29] to perform the simulations on a PC 
with Intel Core i7 965 3.20 GHz CPU, 6G main memory, running Microsoft 
Windows XP 64-bit Professional. The graphics card is NVIDIA Geforce GTX 
295 with 1.75G video memory. GPU-MHD was designed for three-dimensional 
problems, thus the dimensions are expressed in three-dimensional form in all 
tables. For ID test, ID Brio-Wu shock tube problem (see Section 5.1.1) was 
used. For 2D test, 2D Orszag-Tang problem (see Section 5.2.1) was used was 
used. For 3D test, 3D blast wave problem (see Section 5.3) was used. 

Table 1 reports the comparison of GPU-MHD and the FORTRAN/CPU code 
of ID test with different numbers of grid points. Basically there is only about 
10 times speedup (4096 x 1 x 1 case) since the number of grid points is small. 
And it should be realized that the amount of speedup is increased as long as 
the resolution is increased but dropped when the resolution reaches 512. It is 
because the "max threads per block" of the NVIDIA Geforce GTX 295 is 512, 
all the computations are handled within one block and a very high processing 
speed can be archived. 

Table 2 gives the comparison of GPU-MHD using single precision and double 
precision of ID test with different numbers of grid points. A similar speed 
drop happened both in single and double precision modes, but it occurred in 
different resolutions: 512 in single precision and 256 in double precision. This 
is not strange and not difficult to understand since the double precision has 
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Fig. 16. Results oi p double (left) and \ p double - Psingie\ (right) of Orszag-Tang problem 
at t = 0.5L (top) and t = l.OL (bottom) with 512^ grid points 

Table 1 

The performance results of ID test of FORTRAN/CPU and GPU-MHD at different 



resolutions 



Number of grid points 


FORTRAN/CPU 
(ms/step) 


GPU-MHD 
(ms/step) 


Speedup 


128 X 1 X 1 


10.9 


3.6 


3.0277 


256 X 1 X 1 


21.9 


4.0 


5.4750 


512 X 1 X 1 


43.8 


4.5 


9.7333 


1024 X 1 X 1 


87.5 


30.0 


2.9167 


2048 X 1 X 1 


173.4 


32.2 


5.385 


4096 X 1 X 1 


350.0 


36.4 


9.6154 



double size of data to be handled by the processors. Except the special case 
of 512 resolution, the processing speed in both modes are very closed. 
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Fig. 17. Results of Pdouble (top-left), Pdouble (bottom-left) and \pdouble - Psinglel 

(top-right), \pdoubie — Psinglel (bottom-right) of 3D blast wave problem at t = O.IL 
with 128'^ grid points 

The comparison of GPU-MHD and the FORTRAN/CPU code of 2D test 
with different numbers of grid points is presented in Table 3. In 2D case, a 
significant performance improvement is observed, especially when the numbers 
of grid points are 512^ are 1024^, a speedup of around 150 and around 200 is 
achieved, respectively. 

Table 4 presents the comparison of GPU-MHD using single precision and dou- 
ble precision of 2D test with different numbers of grid points. The significant 
performance difference is noticeable when the number of grid points is in- 
creased. However, it still keeps a ratio increasing slowly from 1.118 to 1.6218 
while the resolution increases from 128^ to 1024^. 

Table 5 shows the comparison of GPU-MHD and the FORTRAN/CPU code 
of 3D test with different numbers of grid points. The performance of GPU- 
MHD is faster than the FORTRAN/CPU code about 60 times and 84 times 
when the numbers of grid points are 64^ and 128^, respectively. 
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Fig. 18. Result of Pdouble (top-left), Pdouble (bottom-left) and \pdouble - Psinglel 

(top-right), \pdoubie — Psinglel (bottom-right) of 3D blast wave problem at t = 0.2L 
with 128^ grid points 

Table 5 shows the comparison of GPU-MHD using single precision and double 
precision of 3D test with different numbers of grid points. The ratio is 1.6020 
when the number of grid points is 64^, and is 1.7389 when the number of grid 
points is 128^. 

The performance tests show that when the number of grid points of the test 
problems is small, such as those in ID case, GPU-MHD can give a significant 
performance improvement. When the number of grid points increases, an ob- 
vious disparity of performance becomes clear, especially for multidimensional 
cases (see Table 3 and Table 5). For example, in Table 3, GPU-MHD even 
obtained a speedup of around 150 to 200 compared to the FORTRAN/CPU 
implementation when the number of grid points is 512^ or 1024^. The perfor- 
mance results show that CUDA is an attractive parallel computing environ- 
ment for MHD simulations. 

Computation using double precision on the current CPUs is known to be 
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Table 2 

The performaiiee results of ID test between single preeision and doTiblc precision of 
GPU-MHD at dill'ereul resolutions 



Number of grid points 


Double precision 

(ms/step) 


Single precision 

(ms/step) 


Ratio 


128 X 1 X 1 


3.9 


3.6 


1.0833 


256 X 1 X 1 


4.5 


4.0 


1.1250 


512 X 1 X 1 


29.5 


4.5 


6.5555 


1024 X 1 X 1 


31.0 


30.0 


1.0333 


2048 X 1 X 1 


33.0 


32.2 


1.0248 


4096 X 1 X 1 


39.1 


36.4 


1.0742 



Table 3 

The performance results of 2D test of FORTRAN/CPU and GPU-MHD at different 



resolutions 



Number of grid points 


FORTRAN/CPU 
(ms/step) 


GPU-MHD 
(ms/step) 


Speedup 


128 X 128 X 1 


901.6 


32.2 


28.0 


256 X 256 X 1 


3639.1 


44.8 


82.9236 


512 X 512 X 1 


14715.6 


94.0 


156.5489 


1024 X 1024 X 1 


58276.6 


295.1 


197.4809 



Table 4 

The performance resTilts of 2D test between single precision and double precision of 
GPU-MHD at different resolutions 



Number of grid points 


Double precision 
(ms/step) 


Single precision 

(ms/step) 


Ratio 


128 X 128 X 1 


35.8 


32.2 


1.1118 


256 X 256 X 1 


57.3 


44.8 


1.2790 


512 X 512 X 1 


142.3 


94.0 


1.5138 


1024 X 1024 X 1 


478.6 


295.1 


1.6218 



very low performance compared to single precision. However, in the perfor- 
mance comparison between single precision and double precision modes in 
GPU-MHD, the ratios of the processing speed between two modes show that 
GPU-MHD is efficient enough in double precision computation. Even when the 
number of grid points is 1024^ or 128^, computation using double precision 
still reaches 62.2% or 57.5% of the corresponding single precision processing 
speed. It is due to that GPU-MHD uses indexing operations instead of trans- 
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Table 5 

The performance results of 3D test of FORTRAN/CPU and GPU-MHD at different 



resoluLious 



Number of Grid points 


FORTRAN/CPU 

(ms/step) 


CUDA/GPU 

(ms/step) 


Speedup 


32 X 32 X 32 


739.1 


36.6 


20.1940 


64 X 64 X 64 


5521.9 


90.7 


60.8809 


128 X 128 X 128 


42331.8 


506.4 


83.5936 



Table 6 

The performance results of 3D test between single precision and double precision of 
GPU-MHD at different resolutions 



Number of grid points 


Double precision 

(ms/step) 


Single precision 
(ms/step) 


Ratio 


32 X 32 X 32 


44.6 


36.6 


1.2186 


64 X 64 X 64 


145.3 


90.7 


1.6020 


128 X 128 X 128 


880.6 


506.4 


1.7389 



posing the matrix. These operation also helps keeping the efficiency, while 
read-write operations of transposing the matrix with double precision data 
are much slower than those with single precision data. 



8 Visualization of the simulation results 

There is a need to visualize the MHD simulation data, for examples, Daum [7] 
developed a toolbox called VisAn MHD in MATLAB for MHD simulation data 
visuahzation and analysis. With the help of GPUs, Stantchev et al. [38] used 
GPUs for computation and visualization of plasma turbulence. In GPU-MHD, 
using the parallel computation power of GPUs and CUDA, the simulation 
results of one time step can be computed in dozens or hundreds milliseconds. 
According to the efficiency of GPU-MHD, near real-time visuahzation is able 
to be provided for ID and 2D problems. The motion or attributes of the 
magnetic fluid can be computed and rendered on the fly. So the changes of 
the magnetic fluid during the simulation can be observed in real-time. 

By adding the real-time visuahzation, the flow of GPU-MHD, Fig. 2 is ex- 
tended as Fig. 19: 

GPU-MHD provides different visualization methods for one-dimensional, two- 
dimensional and three-dimensional problems. 
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Sysrem Initialization (CUDA;'OpenGL) 



MHD Test/Problem Initialization 
(Initial conditions/boundary condiliotis) 



i 

Calculation of CFL condition and Und 

out II suitable dt 

l ime / = f + tit, time step +1 
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Fig. 19. The flow chat of GPU-MHD. 

To visualize one-dimensional problems for each time step, the simulation re- 
sults are copied to the CUDA global memory that mapped to the Vertex Buffer 
Object (VBO) [48]. For all grid points, one grid point is mapped to one vertex. 
The position of each grid point is mapped as the x-position of the vertex and 
the selected physical value (p, p, etc.) is mapped as the ^/-position of the vertex. 
Then a curve of these vertices is drawn. Since the VBO is mapped to CUDA 
global memory and simulation results are stored in GRAM, the copying and 
mapping operations are fast. Experimental result shows that GPU-MHD with 
real-time visualization can achieve 60 frame per second (FPS) in single preci- 
sion mode and 30 FPS in double precision mode. Fig. 20 shows two example 
images of ID visualizations using GPU-MHD. 



The operational flow of visualization of 2D problems is similar to that in ID 
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Fig. 20. ID real-time visualizations of the density (p) of Brio-Wu shock tube problem 
with 512 grid points using GPU-MHD. 

visualization. However, instead of Vertex Buffer Object (VBO), Pixel Buffer 
Object (PBO) [48] is used. For each time step, the simulation results are copied 
to the CUDA global memory that are then mapped to the PBO. For all grid 
points, one grid point is mapped to one pixel. The x and y position of each 
grid point are mapped as the corresponding x-position and the ^/-position of 
the vertex and the selected physical value (p, p, etc.) is mapped as the color of 
the pixel to form a color image. To render this color image, a transfer function 
is set to map the physical value to the color of the pixel and then the result- 
ing image is drawn. Similar to VBO, PBO is also mapped to CUDA global 
memory and the simulation results are stored in GRAM, so the copying and 
mapping operations are also fast and do not affect too much to the perfor- 
mance. Although the number of grid points in 2D problem is much larger 
than those in the one-dimension problem, the FPS still reaches 10 in single 
precision mode and 6 in double precision mode, still giving acceptable per- 
formance to the user. Fig. 21 shows two example images of 2D visualizations 
using GPU-MHD. 

However, visualization of 3D problem is different to ID and 2D problems. 
GPU-based volume visualization method [12] and texture memory (or video 
memory) are used. Unfortunately, the current version (Version 2.3) of CUDA 
does not provide the feature to copy the data from the CUDA global memory 
to texture memory directly, even both of them are in GRAM. On the other 
hand, texture memory is readable but is not rewritable in CUDA. So the sim- 
ulation results have to be coped to the main memory first, and then be coped 
to texture memory. In addition, the number of grid points is usually large 
compared to 2D problems and volume visualization techniques are somewhat 
time-consuming. As a result, GPU-MHD only gets 2 FPS in single precision 
mode and 1 FPS in double precision mode, and it is far from real-time. Never- 
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Fig. 21. 2D visualizations of the density (p) of Orszag-Tang vortex problem with 
512^ grid points using GPU-MHD. 

theless, we still get 10 FPS (single precision mode) and 6 EPS (double precision 
mode) for performing the simulation of problems with resolution of 64^ and 
about 20 FPS (single and double precision modes) for problems with reso- 
lution of 32^. Fig. 22 shows two example images of 2D visualizations using 
GPU-MHD. 




Fig. 22. 3D visualizations of the density {E) of 3D Blast wave problem with 128^ 
grid points using GPU-MHD. 

9 Conclusion and future work 

In this paper we present, to the author's knowledge, the first implementation of 
MHD simulations entirely on CPUs with CUDA, named GPU-MHD, to accel- 
erate the simulation process. A series of numerical tests have been performed to 
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validate the correctness of our code. Accuracy evaluation by comparing single 
and double precision computation results is also given, indicating that double 
precision support on GPUs is a must for long-term MHD simulation. Perfor- 
mance measurements of both single and double precision modes of GPU-MHD 
are conducted. These measurements show that GPU-MHD achieves speedups 
(in single precision mode) of about 10 (ID problems with 4096 grid points), 
200 (2D problems with 1024^ grid points), and 84 (3D problems with 128'^ grid 
points), respectively, compared to the corresponding serial CPU MHD imple- 
mentation. For double precision computation, GPU-MHD still can achieve 
about 60% speed of the corresponding single precision computation. In order 
to provide the user better understanding of the problems being investigated 
during the simulation process, we have extended GPU-MHD to support visual- 
ization of the simulation results. With GPU-MHD, the whole MHD simulation 
and visuahzation process can be performed entirely on GPUs. As NVIDIA has 
announced the next generation CUDA GPUs — Fermi [26], which provides 
8x speedups in double precision computation compared to the current gener- 
ation CUDA GPUs. GPU-MHD will benefit this advance and will give better 
performance on Fermi series GPUs, giving GPU-MHD useful for wide range 
applications in astrophysics or space physics. 

There arc two directions in our future work, firstly, we are going to extend 
GPU-MHD for multiple GPUs and GPU cluster [33] to fully exploit the power 
of GPUs. Secondly, we will investigate implementing other recent high-order 
Godunov MHD algorithms such as [19] and [41] on GPUs. These GPU-based 
algorithms will be served as the base of our GPU framework for simulating 
large-scale MHD problems in space weather modehng. 
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